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This paper discusses the problem of adaptive estimation of a 
univariate object like the value of a regression function at a given 
point or a linear functional in a linear inverse problem. We consider 
an adaptive procedure originated from Lepski [Theory Probab. Appl. 
35 (1990) 454-466.] that selects in a data-driven way one estimate 
out of a given class of estimates ordered by their variability. A seri- 
ous problem with using this and similar procedures is the choice of 
some tuning parameters like thresholds. Numerical results show that 
the theoretically recommended proposals appear to be too conserva- 
tive and lead to a strong oversmoothing effect. A careful choice of 
the parameters of the procedure is extremely important for getting 
the reasonable quality of estimation. The main contribution of this 
paper is the new approach for choosing the parameters of the proce- 
dure by providing the prescribed behavior of the resulting estimate 
in the simple parametric situation. We establish a non-asymptotical 
"oracle" bound, which shows that the estimation risk is, up to a log- 
arithmic multiplier, equal to the risk of the "oracle" estimate that is 
optimally selected from the given family. A numerical study demon- 
strates a good performance of the resulting procedure in a number of 
simulated examples. 

1. Introduction. This paper discusses the problem of selecting one es- 
timate from a given family of estimates {9k, k = 1, . . . , K} of a univariate 
object 6. We suppose that every estimate can be represented as 

(1.1) ek = ek + Ck, k = i,...,K, 

where 9^ is the expectation of 6 k -^9 k = dk ^ind £,i,---,(,k are zero mean 
random errors. In what follows we assume that ■ ■ ■ ,£,k) is a Gaussian 
vector with a known covariance matrix B. This problem is illustrated by 
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two major examples: estimating a regression function at a given point and 
estimating a linear functional in a linear inverse problem. In the case of 
a Gaussian regression model Yi = f{Xi) + ej, the target of estimation is 
the value of the unknown regression function /(x) at a certain point x. 
The set {9^} can be obtained as kernel or local polynomial estimates with 
different bandwidths. In the case of a linear inverse problem, the target 
is usually the value of a linear functional and the family of estimates is 
obtained by using different values of the regularization parameter for the 
regularized inversion. Note that the representation (1.1) can be regarded as 
a reasonable approximation for many other statistical models and problems 
like regression with non-Gaussian errors or density estimation. 

The problem of adaptive estimation can be formulated as the best pos- 
sible choice of one estimate out of this family on the basis of the available 
information. This problem can be viewed as the problem of model selection, 
see, for example, Birge and Massart (1993, 1998), Birge (2006), Juditsky, 
Rigollet and Tsybakov (2008) and references therein. However, there is an 
essential difference between the (global) model selection problem and the 
problem of pointwise estimation considered in this paper. In the problem of 
global model selection one tries to recover the whole underlying model, that 
is, the target is the model itself. Here we consider the problem of recovering 
a one-dimensional characteristic of the whole model like the value of the 
function at a certain point. This makes these two problems quite different. 
In particular, for the problem of pointwise adaptation some additional as- 
sumptions on the considered family of estimates are required. Typically one 
assumes that the given family of "weak" estimates 9k is ordered in the sense 
that the variance Vk of 9k decreases with k. Another intrinsic assumption on 

the considered set-up is that the squared bias 6^ *== [9^ — 9)'^ is small for the 
k = 1 and it may increases with k. The most popular example is given by 
kernel estimates with different bandwidths so that the starting bandwidth 
hi is small leading to the small bias but a huge variance of estimation. 
As the bandwidth grows the variance decreases but the bias may increase 
dramatically. The aim is to construct from the data one estimate that per- 
forms in the best possible way and particularly minimizes the corresponding 
estimation risk. 

The first adaptive procedure of this sort was suggested in Lepski (1990) 
and extended in Lepski (1992) to much more general set-up. The idea is 
to select the largest index k such that the estimates 9i, ... ,9k do not differ 
significantly with each other. Two estimates 9i and 9k for / < k differ sig- 
nificantly if the standardized difference Tik '= v'f^{9i — 0^)^/2 exceeds the 
prescribed threshold 3, which can be dependent of I, 3 = 3;- Lepski (1990) 
stated the rate optimality of this procedure over Holder smoothness classes, 
and Lepski, Mammen and Spokoiny (1997) showed its spatial adaptivity 
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in the sense of rate optimality over Besov functional classes and established 
some oracle risk bound. Lepski and Spokoiny (1997) proved sharp optimality 
of a slightly modified procedure in the asymptotic minimax sense. However, 
all the mentioned results have been established under some conditions on 
the thresholds 3^, which basically means that the thresholds have to be suf- 
ficiently large, and they tell nothing if this condition is not fulfilled. At the 
same time, numerical results in simulated and practical data examples show 
that applying a large threshold typically leads to a conservative procedure 
and oversmoothing effects. In this sense, one can say there is some critical 
gap between the theory and practical applications. 

Our paper presents a novel method for selecting the tuning parameters of 
the method based on the so called "propagation" condition, which postulates 
the desirable performance of the method in the simple parametric situation. 
The idea is similar to the problem of hypothesis testing for which the critical 
value of a test is selected by bounding the first-kind error probability under 
the null hypothesis. The theoretical study is done for the adaptive estimate 
with the selected tuning parameters. The main result claims the desired 
oracle risk bound for this defined procedure. The proposed approach seems 
to be quite general and it can be directly applied to many other procedures 
including local model selection, stagewise aggregation and local change-point 
analysis, which are studied in details in Spokoiny (2009) in a much more 
general set-up. 

Golubev (2004) proposed another "risk envelope" approach to select the 
threshold for a special sequence space model and a particular linear func- 
tional. We consider this example in Section 1.3. The common point between 
Golubev (2004) and our proposal is the selection of the parameters of the 
method by a Monte Carlo simulation from the model with zero response. 
However, the procedure, motivation and theoretical analysis of our study is 
quite different from the one in Golubev (2004). 

Theoretical properties of the proposed method are presented in Section 3. 
The main result states the "oracle" property of the proposed estimate: the 
risk of the adaptive estimate is within a log-multiple as small as the risk of 
the "oracle" estimate for the given model. The results are established in the 
precise nonasymptotic way in a rather general form. Our simulation study 
in Section 4 confirms a nice finite sample performance of the procedure for 
a rather big class of different models and problems. 

Below in this section we present three major examples for which the pro- 
posed procedure can be applied. We start with the problem of pointwise 
bandwidth selection in kernel estimation, then we discuss the problem of 
estimating a linear functional in a linear inverse problem and then specify 
it to one particular functional in the sequence space model. 
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1.1. Bandwidth choice in kernel estimation. Consider a regression model 
Yi = f{Xi) + Ei where £i are i.i.d. Gaussian errors with zero mean and known 
variance o"^ and with a deterministic design Xi, . . . , Xn in IR"^. The considered 
problem is to estimate the value of the unknown regression function f{x) at 

a given point x. Let a sequence of localizing scheme W'^^'^ = {wf^^^ have been 
fixed for A; = 1, . . . , In the case of kernel weights, this sequence is built just 
by using different values of the bandwidth h from the smallest bandwidth 
hi to the largest value hx in the form w^^^ = ip{\Xi — x\/h]^) with a kernel 
function '(/'(•). Every localizing scheme yields the corresponding estimate 

n n 
i=l i=l 

By simple algebra 

n n 

e, Be, = y: = 0k - m = N^' E 

1=1 i=l 

Moreover, 

i=l 

The above ordering condition can be written for the case of the kernel 
weights in the form A'^; < N, for I <k. Below we will assume even a stronger 
condition that the values grow exponentially. 



1.2. Estimation of a linear functional in a linear inverse problem. Con- 
sider a general set-up of a linear inverse problem when the observed data Y 
from a Hilbert space TCy are modeled by a linear operator equation 

(1.2) Y = AX + e, 

where X is the unknown parameter vector from some Hilbert space TCx, 
A : TCx T~Cy is a linear operator and e is a random Gaussian noise in Tiy 
with the known correlation structure given by the covariance operator E. 
The goal is to estimate a linear functional 6 = 0{X) that can be represented 
in the form {'d,X) for some known element € TCx- Such problems are 
usually considered as more complex than the usual nonparametric regression 
estimation due to the poor rate of estimation. Moreover, the difficulty that is 
usually associated with the attained estimation accuracy increases with the 
degree of illposedness. A naive estimation approach is based on the explicit 
least-square solution of the problem (1.2): 

9 = {'d,{A*A)-A*Y) = {A{A*A)-'d,Y) = {^,Y), 
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where A* is the conjugate operator to A, C~ means a pseudo-inverse of C 
and (j) = A{A* A)~'d. However, this approach cannot be efficiently apphed if 
A is a compact operator because the inverse of A* A does not exist or is 
an unbounded operator. One can regularize the problem if some additional 
information about smoothness of the element X is available. This allows to 
replace {A*A)~ by its regularization ga{A*A) where ga means some regu- 
larized inversion and a is the corresponding parameter. See, for example, 
Goldenshluger and Pereversev (2003), Goldenshluger (1999) and Goldensh- 
luger and Pereversev (2000) for typical examples. The quality of estimation 
heavily depends on the choice of the regularization parameter a and its 
choice is a challenging problem. Usually one fixes a finite ordered set of 
values ai < 02 < • ■ • < and considers the corresponding estimates 

9k = {<Pk,Y), cPk = Ag^,{A*A)^. 

Now the original problem can be reformulated as follows: given a set of 
estimates 9^ for known vectors 0^, build an estimate 9 of the functional 
9 that performs nearly as good as the best in this family. We present one 
particular example for the considered set-up borrowed from Golubev (2004). 
More examples include a positron emission tomography problem. Cavalier 
(2001), functional data analysis, Cai and Hall (2006), among many others. 

Our analysis focuses on demonstrating the oracle efficiency of the con- 
structed adaptive procedure rather than on establishing the optimal rate of 
convergence on functional classes. The mentioned efficiency of any adaptive 
(data-driven) method can be measured by the ratio of the risk of the pro- 
posed method to the "oracle" risk which corresponds to the optimal choice 
of the regularization parameter for the model at hand. One message of this 
note is that this statistical part of the linear inverse problem is actually 
not harder than in the classical nonpar ametric inference. Moreover, in the 
inverse problem set-up it is typically easier to do a statistical adaption be- 
cause the likelihood profile is not so flat as in the classical nonparametric 
regression. In some examples presented in our simulation study in Section 4, 
the risk of the adaptive procedure is even smaller than the oracle risk. 

1.3. Example for a sequence space model. We consider the statistical 
problem with observations yi, . . . , i/m following the "sequence space" equa- 
tion 

(1.3) yi = iii + aiEi, i = l,...,M, 

where are independent standard normal and the standard deviations cTj 
are known while the mean values fii are unknown. The variances af are 
usually constant for the regression set-up or grow with i for ill-posed inverse 
problems. 
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One particular problem in this set-up can be to estimate the sum 

e = fii H \-HM, 

where M can be equal to infinity assuming Jhat the sum of the /ij's is 
absolutely convergent. The "naive" estimate = J2if=iyi7 even for a finite 
M, has a very large variance X)f=i '^f ^^'^ hence, can be highly inefficient. 
The smoothing idea leads to the set of the spectral cut-off estimates 

Ok = {4>k,Y) =yi-\ ^Vnik, 

where = (1, . . . , 1, 0, . . . , 0) is the vector with the first entries equal to 
one and the others equal to zero, while is a fixed decreasing sequence of 
finite indices M > mi > m2 > ■ ■ ■ > mx > 1- 

One can easily compute for k = 1, . . . ,K and / < k 

&k '= E^fc = ^J'l^ 1- /^mfc, ^fc '== ^fc — ^fc = Cl^l H + Cmfcemfe 

and Vk *== Var(0fc) = al + ■ ■ ■ + o"^^- The major difficulty in applying the 
smoothing approach is the proper choice of the parameter k or, equivalently 
the cutting point m^. Small values of k lead to a huge variance of the esti- 
mate Ok while large fc-values can result in a big bias bk = — 6^ = J2itmk+i l^i- 
The "oracle" choice balances the approximation and stochastic errors. How- 
ever, this ideal choice assumes that the bias (the approximation error) is 
known. The problem we consider in this paper is to develop an adaptive 
(data-driven) choice that mimics the "oracle" and achieves the best possible 
performance among the set of estimates Ok- 

2. Description of the method. This section presents the considered adap- 
tive estimation procedure. We first describe some simple properties of the 
estimates 9^ that will be used in the construction. Then we present the 
adaptive estimation method. _ _ 

The definition 9k = Ok + Cfc for any k < K yields 'EOk = 9k and Var Ok = 
E^^ = Vk- Moreover, is a zero- mean Gaussian random variable and for 
any r > and any A < 1 

(2.1) B\v^\dk-Ok)^\'' = Cr, 

(2.2) Bexp{\v^\9k - 9kf/2} = (1 - \y'/\ 

where c,- = El,^!^*" and ^ is standard normal. Due to this result, 9k is a 
reasonable estimate of 9 if the bias 9k — 9 is sufficiently small relative to the 

1/2 

standard deviation u ^ . In particular, in the "no bias" situation 9k = 9 the 

~ 1/2 

estimate 9k leads to the accuracy of order VjJ and one can build confidence 
intervals for the parameter 9k in the form 

(2.3) Sk{i) = {u:v^\9k-uf/2<i}. 

If 3 is sufficiently large, then the result (2.2) ensures that ^^^(3) contains 9k 
with a high probability. 
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2.1. Adaptive choice of an estimate out of a given family. Our starting 
point is the given family of estimates 9k for k = 1, . . . ,K ordered by their 
variabihty so that the variance of 6^ decreases with k. We aim to select 
a data-driven index k or equivalently the estimate 9 = 9^, which minimizes 
the corresponding estimation risk. 

For a given sequence of estimates 9^ = 9^ + consider the sequence of 
nested hypothesis Hk'-9i = ■ ■ ■ = 9^ = 9. The procedure is sequential: we 
start with k = 2 and at every step k the hypothesis is tested against 
Hi, . . . ,H]^_i. If iJfc is not rejected then we continue with the next larger k. 
The final estimate corresponds to the latest accepted hypothesis. For testing 
Hk against Hi with I <k, we check that the new estimate 9k belongs to the 
confidence intervals built on the base of 9i . More precisely, we apply the test 
statistics: 

Tik = vr\9i-9k)^/2, l<k, 

where vi is the variance of 9i. Big values of Ti^ indicate a significant differ- 
ence between the estimates 9i and 9k- Due to the definition (2.3), the event 
Aik = {Tik < 3z} means that 9k belongs to the confidence set £i{^i) based on 
9i. The estimate 9k (or the hypothesis Hk) is accepted if Hk~i was accepted 
and Tik < 3^ for all I < k, that is, the new estimate 9k belongs to the inter- 
section of all the confidence intervals £i{}i) built on the previous steps of the 
procedure. The formal definition is given by 

k = max{A; < K : Ti^^ < 3/, \/l <m< A;}. 

Here the "critical values" 31, . . . iIk-i are the parameters of the procedure. 
Their choice is discussed in Section 2.2. 

The selected random index k means the largest accepted k. The cor- 
responding adaptive estimate 9 is 9^:9 = 9^. We also define the adaptive 

estimate 9k as the latest accepted after the first k steps: 

9k — 9 . rT 7 1 • 

The described procedure involves K — 1 parameters and their automatic 
choice is ultimately required for practical applications of the method. Our 
next step is the method for an automatic selection of the critical values 3^. 

2.2. Choice of the critical values ^k using a "propagation condition." The 
way of selecting the critical values 31, . . . ,3/^-1 is similar to the standard ap- 
proach of hypothesis testing theory: to provide the prescribed performance 
of the procedure under the simplest (null) hypothesis. In the considered 
set-up, the null means 9i = 92 = -- - = 9k = 9. We will show below in The- 
orem 3.3 that the particular value of 9 is unimportant and it suffices to 
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only consider 9 = 0. In what follows we denote by Pq the distribution of the 
data in this situation and Eq means the corresponding mathematical ex- 
pectation. By the definition of the procedure, accepting for some k <K 
yields 9^ = 9^ and rejecting of means 9^ 7^ ^fc- We refer to the latter as 
a "false alarm" because the procedure terminates in the situation where it 
should not. If such false alarms occur too often, it is an indication that the 
critical values 3^ are not large enough. The usual a-level condition on any 
testing procedure is that under the null it rejects the null hypothesis with 
the probability not exceeding a. For the considered multiple test procedure 
this condition reads as Po(^A: ^k) ^ 0(- We slightly modify this condition 
to adapt it to the problem of adaptive estimation by selecting a polynomial 
loss function instead of the indicator of the error decision. Rejecting the null 
hypothesis happens if v^^{9i — 0^)^/2 > }u in which can be interpreted that 
the estimate 9k does not belong to the confidence interval built on the 

base of 9i . In the testing problem it only matters how often such false alarms 
occur. In the considered problem of adaptive estimation we focus on the risk 
associated with such a false alarm. Therefore, the particular indices l,k and 
the size of v^^{9i — 9^)^ matter as well. Suppose that some loss power r > 
is fixed. By (2.1) 

Eokr^(^/-^)T = Cr, foralU<K, 

where = s-rid ^ is standard normal. We require that the parameters 

3i, . . . ,iK-i of the procedure are selected in such a way that 

(2.4) ^o\vl\9k-9kf\' <oitr, k = 2,...,K. 

The meaning of this condition is that at every step k of the procedure, the 
risk associated with false alarms is at most an a-fraction of the best-possible 
estimation risk. Here a is the preselected constant, which is similar to the 
confidence level of a testing procedure. This gives us K — 1 conditions to 
fix i^' — 1 parameters. As in the testing problem, we are interested to select 
the critical values as small as possible under the constraint (2.4). Note that 
the choice r very close to zero leads back to the indicator loss function 
l{9k 7^ 9k) and thus, to the usual error of the first kind for the multiple 
testing procedure. 

Our definition still involves two parameters a and r. It is important to 
mention that their choice is subjective and there is no way for an automatic 
selection in the considered local or pointwise set-up. Moreover, the possi- 
bility of tuning such parameters in particular applications is an important 
advantage of the approach. Our aim is to develop a procedure that combines 
and balances two important features: stability in the parametric situation 
and sensitivity under deviations from the parametric null hypothesis. The 
propagation condition (2.4) is exactly a constraint on the stability in the 
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parametric case, and we aim to optimize the sensitivity of the method un- 
der this constraint. A proper choice of the power r for the loss function as 
well as the "confidence level" a depends on the particular application and 
on the additional subjective requirements to the procedure. Taking a large 
r and small a would result in an increase of the critical values and therefore 
improves the performance of the method in the parametric situation at cost 
of some loss of sensitivity to deviations from the parametric situation. This 
behavior is analogous to the hypothesis testing problem where a small a re- 
duces the first-kind error at costs of the test's power. Theorem 3.1 presents 
some upper bounds for the critical values 3^ as functions of a and r in the 
form ailog{K) + a2{loga~^ -|- r{K — k)} with some coefficients ai and 02- 
We see that these bounds linearly depend on r and on loga~^. For our 
examples, we apply a relatively small value r = 1/2. We also apply a = 1 
although the other values in the range [0.5,1] lead to very similar results. It 
is worth mentioning that both the procedure and the theoretical study apply 
and lead to reasonable results whatever r and a are. This makes a striking 
difference with many other proposals; see the references in the introduction 
for selecting the tuning parameter(s). Typically one requires that the critical 
values (thresholds) 3 are sufficiently large and the theory is only valid under 
this condition. 

2.3. Sequential choice. The set of conditions (2.4) does not directly de- 
fine the critical values 3^. We present below one sequential method for fixing 
3fc one after another starting from 31. The idea is to provide that the relative 
impact of each 3^ in the total risk in (2.4) is the same for every k < K — 1. 
We start with 31 and set 32 = • • • = ^k-i = 00. This effectively means that 
every new estimate 9k is only compared with 0i . We run the procedure with 
such critical values. The resulting adaptive estimate after step k is denoted 
by Ok{li)- We select 31 as the minimal value providing 

(2.5) ^oK\Ok{li)-Ok?Y <-r^aCr, k = 2,...,K. 

A — i 

Such a value exists because the choice 31 = 00 leads to 9k = 9^ for all k. 

Similarly, we specify 32 by considering the situation with the previously 
fixed 3i, some finite 32 and all the remaining critical values equal to infinity, 
and so on. For the formal definition, suppose that 3i,...,3m-i have been 
already fixed for some m > 1 and define for any 3^ the adaptive estimates 
^fc(3i) • • • >3m) for k > m, which come out of the procedure with the critical 
values (31, . . . ,3m., 00, ... , 00). We select 3m as the minimal value providing 

(2.6) Bo\v^^{9k{ii,...,in,)-9k}^V <-r^aCr, k = m + I, . . . , K. 

A — i 
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Such a value exists because the choice 3^ = 00 leads to ^^(31, • • • ,3m) = 
^fc(3ii • • • >3m-i) and even a stronger condition has been already checked at 
the previous step. 

The condition (2.5) describes the impact of the first critical value in the 
risk (2.4) while (2.6) describes the accumulated impact of the first m critical 
values. The factor m/{K — 1) in the right-hand side of (2.6) is chosen to 
ensure that every critical value ik has the same impact. 

Our construction guarantees that the selected sequence ik is minimal 
under the set of conditions (2.6) in the sense that one cannot select another 
sequence I'j^ < 3^ for all k such that (2.6) is still fulfilled. Indeed, let {3'^^} be 
another sequence that ensures (2.6) and let m be the first index for which 
l'm<lm- Then the condition 

Eobfc H^fc(3'i, • • • ,3m-i>3m) - ^fclY < ;^-rY«'^r-, k>m, 

on irri is even stronger than (2.6) and one cannot select l'„^ < 3m to ensure 
it. This contradiction shows minimality of the sequence 3^. 

An explicit form for the critical values 3fc is not available but they can be 
easily computed using the Monte Carlo simulations from the null hypothesis; 
see Section 4 for details. 

3. Theoretical study. This section presents some properties of the adap- 
tive estimate 6 of the target value 9. We suppose that the parameters 3^ of 
the procedure are selected in such a way that the condition (2.4) is fulfilled. 
The main result is the "oracle" property of the adaptive estimate 6, which 
claims that the risk of adaptive estimation is up to some multiplier as good 
as the risk of the ideal (oracle) estimate. This multiplier is directly related 
to the applied critical values 3^ and in typical situations it is at most loga- 
rithmic in the sample size. In the proof we distinguish between three cases: 
parametric, local parametric and nonparametric. The parametric case means 

dcf ~ 

that 6k = E9k = 9 for all k < K. This case can be easily reduced to the null 
hypothesis 9i = • • • = 9k = and the oracle property of the adaptive esti- 
mate 9 is ensured by the construction, more precisely, by the propagation 
condition (2.4). The local parametric case means that for some k < K holds 
9i = ■ ■ ■ = 9k = 9. In this case, the construction ensures the oracle property 
for the adaptive estimate 9k obtained after the first k steps of the procedure. 
Then we show that a similar oracle property of the estimate 9k can be ob- 
tained in the nonparametric situation under the so-called "small modeling 
bias" condition. This condition is used to give a formal definition of the ora- 
cle choice. The final oracle result for the adaptive estimate 9 is obtained by 
combining the previously established "propagation" result under the small 
modeling bias condition with the "stability" property, which is ensured by 
the adaptive procedure itself. 
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3.1. Bounds for the critical values. This section presents some upper and 
lower bounds for the critical values 3^. The results are established under the 
following condition on the variances v^- 

(MD) for some constants Uo,u with 1 < uq < u, the variances satisfy 
Vk~i<uvk, UoVk<Vk-i, 2<k<K. 
We also denote for I <k < K 

vi,k'^= ^aridi - Ok)- 

Our first result presents some upper bound for the parameters ik under 
condition {MD). The proof is given in the Appendix. 

Theorem 3.1. Assume {MD). Let 7 he such that for all I < k < K 
(3.1) vi^k/vi<^. 

Then there is a constant Ci depending on r, uq, and u only such that the 
choice 

Ik = 7{loga"^ + r\og{vk/vK)} + Ci log If 
ensures (2.4) for all k < K. Particularly, Y1q\v^{9k — 0)'^Y < atr- 

Remark 3.1. The result of Theorem 3.1 presents some upper bounds 
for the critical values. These upper bounds will be used for our theoretical 
study; however, they do not appear in the proposed adaptive procedure. An 
interesting observation is that these upper bounds linearly decrease with 
k. Indeed, by condition {MD) \og{vk/vK) < {K — k) logu and \og{vk/vK) > 
{K — /c)loguo. The reason for a decrease of 3^ with k can be explained 
as follows. Under the null hypothesis the procedure should not terminate 
at intermediate steps and the oracle estimate is 6k- An early stop ("false 
alarm") k = k for k < K results in selecting the estimate 6k, which has much 
larger variability than Ok- The smaller k is, the larger is the associated loss in 
the estimation quality. Therefore, the test at the early stage of the procedure 
should be rather conservative while a "false alarm" at the final steps of the 
procedure is not so critical, and we are more interested to improve sensitivity 
by applying nonconservative critical values. 

Our next result shows that the linear growth of the critical values }k with 
K — k is not only sufficient but also necessary for providing (2.4). To high- 
light the contribution of every particular value ^k, we consider the situation 
when all the previous parameters are equal to infinity: 31 = • • • =ik-i- This 
effectively means that the procedure cannot terminate at the first k—1 steps 
due to a possibly wrong choice of the corresponding critical values. 
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Theorem 3.2. Assume (MD). Suppose that for a fixed k < K , it holds 
3i = • • • =3fc_i = oo. Then the condition (2.4) implies that 

Ik > ^^^^^{rlog{vk,K/vK) + loga~'^ - C2log{K)} 

Vk 

for some positive constant C2 depending on r, u, Uq only. 
The proof is again moved to the Appendix. 

Remark 3.2. Our main oracle result particularly shows that the lead- 
ing term in the risk linearly depends on the value 3^* where k* is the opti- 
mally selected index. Therefore, obtaining a sharp oracle result would require 
bringing together the upper and lower bounds for the critical values 3^. In 
our results these two bounds differ by the factor jVk/vk^k+i with 7 from 
(3.1). The value 7 is usually close to one because the estimates 9k are pos- 
itively correlated with each other in the most of cases. However, the value 
Vk,k+i/vk can be small by the same reason. So, obtaining a sharp oracle re- 
sult would require some modification of the presented procedure; cf. Lepski 
and Spokoiny (1997). The further discussion of this issue lies beyond the 
scope of this paper. 

3.2. Behavior in the local parametric situation. The parametric situation 
can be understood as the case when 61 = 62 = ■ ■ ■ = 9^. In this case the 
estimate 9k is unbiased and has the srnallest variance and hence, the smallest 
risk described by the formula ^\v~j^{9K — 9)"^^ = Cr. A natural requirement 
to any adaptive procedure is to provide a similar accuracy of the adaptive 
estimate under the parametric hypothesis. Similarly, the local parametric 
situation corresponds to the case when 9i = ■ ■ ■ = 9k = 9 for some k < K. In 
this case it is natural to require that the adaptive estimate 9k after k steps is 
close to its nonadaptive counterpart 0^. This property is actually provided 
by the construction of the critical values. 

Theorem 3.3. Let 9i = 92 = ■ ■ ■ = 9k = 9. Then it holds 

B\v]^^{9-9k)^\' <acr. 
Moreover, if 9i = 92 = ■ ■ ■ = 9k = 9 for some k< K , then 

nvl\9k-9kf\' <atr. 

Proof. Only the differences 9i — 9k appear in the definition of the test 
statistics T/^, . In view of the decomposition 9k = 9 + ^j,, the value 9 cancels 
there. Similarly, the adaptive estimate 9k coincides with one of 0i, . . . , 0^ and 
the value ^cancels in the difference 9k — 9k as well. Hence, we can assume 
= and 9k = ^fc- Then the results follow from the constraints (2.4) on the 
critical values 3^. □ 
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3.3. "Small modeling bias" condition and "propagation" property. The- 
orem 3.3 describes the performance of the estimate Ok under the parametric 
or local parametric assumption. Now we aim to extend this result to the 
general nonparametric situation when the identities 6i = 62 = ■ ■ ■ = 6k = 
are only approximately fulfilled and the deviation from the null hypothesis 
Hk is not significant. 

As mentioned in Section 2.2, the choice of critical values 3^ is determined 
by the joint distribution of the test statistics Tik = vf^{6i — 9k)^ under the 
measure Pq corresponding to the parametric hypothesis 9i = 62 = ■ ■ ■ = 0^ = 
0. An extension of this result to the nonparametric situation leads to con- 
sidering the similar distribution in the general case. Let P^. mean the joint 
distribution of 6{k) = {9i, . . . , 9k)~^ for k > 1. By the model assumption, this 
is a Gaussian vector with E9{k) = 9{k) = {9i,. . . , Ok)^ ■ Let also Bk be the co- 
variance matrix of the vector 9{k). Then P^ is the normal distribution with 
the mean 9{k) and the covariance matrix Bk, Pfc = M{9{k), Bk)- Similarly, 
Fe^k denotes the distribution of 9{k) under the local parametric situation 
9^ = ... = 9k = 9, that is, Pq k=^f{Oo{k), Bk), where 9olk) = {9, . . . ,9)^ . Let 
b{k) = {h,...,bk)^ with bk = 9k-9. 

Lemma 3.1. For k>l, define 

Ak''^'b'^{k)B-^{k). 
Then the KuUback-Leibler divergence IC(Pk,Pe,k) fulfills 

}C{Pk,Pe,k) = mog[^^)=Ak/2 

and the values Ak grow with k. It also holds for any s > 1 

1. „ fdPkV Afc(s-l) 

-logEe^fc -— — = — 

s ydPe^kJ 2 

Moreover, if ^ is measurable function of 9i, ... ,9k, then with s' = s/{s — 1) 

EC < (Ee.feCV/'' exp{Afc(s - l)/2}. 

In particular, for s = 2 it holds < {e'^'^Ee^kC'^)^^'^ ■ 

Proof. Define Zk = dPk/dPe,k- Then 

logZfc = b^{k)Bl^'^ik + b''{k)B-%k)/2 

with ^k ~AA(0, 1) and hence Efclog(Z/fc) = Ak/2. Therefore, is twice the 
Kullback-Leibler divergence between two measures Pfc and Pg^k obtained by 
projecting the measures P and Pq on the cr-field generated by ^1, . . . , and 
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growing with k. This immediately impHes that monotonously increase 
with k, that is, < A^/ for k < k'. Similarly, 

Ee,fcZ| = E,,fcexp{s6^(fc)5-'/'a- - b^{k)B,;\k)s/2} 
= exp{b^{k)B,;^b{k){s^ - s)/2}. 

Next, let ^ be a measurable function of the vector 6{k). It holds = 
^0,kC^k- By the Holder inequality 

E,,fcC^fc<(Ee,fcr')V^'(Ee,fcZ|)i/^ 

and the assertion follows. □ 

Due to Lemma 3.1, the value can be used to measure the distance 
between the two models: one corresponds to the local parametric situation 
with 9i = O2 = ■ ■ ■ = Ok = 9 and the other describes the distribution of the 
same vector 9[k) in the general nonparametric situation. We call this value 
Afc the modeling bias because it describes how much we have to pay in the 
risk for using the "wrong" parametric model in place of the underlying non- 
parametric one. The "small modeling bias" (SMB) condition simply means 
that the value A^ does not exceed some sufficiently small value A. 

The result of Lemma 3.1 implies that the bound for the risk of estimation 
Eo{f^^(6'fc — 9)'^Y under the parametric hypothesis translates under the 
SMB condition Ak < A into the bound for the risk E{v]^^i9k - 6')2}^/^'. 
Similarly one can bound E{?;^^(0/; — 9k)'^Y^'^ . 

In what follows we apply the result of Lemma 3.1 with s = s' = 2, which 
nicely simplifies the notation. Note, however, that any s > 1 can be used. 
For instance, taking a large s leads to the value of s' close to one. 

Theorem 3.4. For any r > 0, it holds for every k < K 

B{v^\9k-9kfr^^<^/^^r. 

The bound follows directly from Lemma 3.1 and Theorem 3.3. 

We call this result the "propagation" property because it ensures that with 
a high probability the procedure does not terminate yielding = 9^ as long 
as the SMB condition A^ < A is fulfilled. Note that a similar property has 
been proved for the original procedure in Lepski (1990) and Lepski (1991, 
1992), however, under the additional condition that the critical values 3^ 
are sufficiently large. We instead use the propagation condition (2.4) and 
the SMB condition. 
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3.4. "Stability after propagation" and oracle results. Due to the "prop- 
agation" result of Theorem 3.4, the procedure performs well as long as the 
SMB condition is fulfilled, which means that the value remains bounded 
by some (small) constant. We formalize this condition in the form < A. 
Here A is an arbitrary number that will determine the oracle choice. We will 
show in Section 3.6 that in typical situations this value A is similar to the 
ratio of the squared bias to the variance of O^- Note however, that the value 
A only appears in our theoretical study; it does not affect the procedure. 
The results apply whatever A > 0. 

To establish the accuracy^result for the final estimate 9, we have to check 
that the adaptive estimate 9k does not vary much at the steps at which the 
modeling bias A^ becomes large. 

Theorem 3.5 (Stability). It holds for every k < K 

(3.2) v^H9k-9fl(k>k)<2u. 

Proof. The result follows by the definition oi 9 = 9-^ and 9k = 
because k is accepted and min{A;, k} <k. □ 

Combination of the "propagation" and "stability" statements iniplies the 
main result concerning the properties of the adaptive estimate 9. In the 
formulation of this and the further results we assume some constant A > 
to be fixed. We also assume that our set-up is reasonable in the sense that for 
the very first model the SMB condition Ai < A, or equivalently, 6f < Afi, 
is fulfilled. This enables us to correctly define the ideal index k* . 

Theorem 3.6. Let k* he the maximal value k such that A^ < A. Then 

(3.3) nvl}{9k' - 9f\'l^ < \latre^ + {2ik'Y'^. 

Proof. The events l(k>k*) and l(k<k*) do not overlap and 9 = 9k* 
for k <k* . This yields the representation 
B\v^}{9k* -9fr/^ = B\v^}i9k* -9)Y'Hk> k*) + B\v^}i9^^ 
Now the result follows from Theorems 3.4 and 3.5. □ 

3.5. Discussion. Here we discuss some issues related to the stated oracle 
result. 

"Oracle" quality. Theorem 3.4 ensures that the estimation loss vj^^{9k — 
9)"^ is bounded with a high probability if the modeling bias A^^ is not too 
big. The oracle choice k* is the largest one for which the SMB condition 

~ 1/2 

Afc < A holds leading to the accuracy \9k* — ^| of order v^, . We aim to 
build an adaptive estimate that delivers the same quality as the oracle one. 
Theorem 3.6 claims that the difference 9 — 9k* between the adaptive estimate 
9 and oracle is indeed of order Vf^i up to the factor yJ2ik* ■ 
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The "true" value 9. The "true" value 9 is not explicitly shown in the 
oracle inequality (3.3). It only enters in the definition of the modeling bias 
Afc and thus, in the SMB condition < A and in the definition of the 
oracle choice k* . The oracle bound just compares the optimal choice of the 
index k* for the given nonparametric model (1.1) with the adaptive index 
k. In fact, the model (1.1) does not require a "true" value 9 to be defined 
and the oracle result can be formally applied for any 9. However, in our two 
basic examples of nonparametric regression and linear function estimation 
such values are defined in a natural way. The quality of estimation of this 
value 9 can be easily derived from the oracle bound (3.3). We present the 
corresponding result about the risk of the adaptive estimate 9 for the special 
case with r = 1. The other values of r can be considered as well, one only 
has to update the constants depending on r. We also assume that a < 1. 

Corollary 3.7. Let k* be the largest k with Afc < A. Then 
v^,^^^B\9-9\<2V^+./2^. 

Proof. Just observe that 

and the result follows from Theorem 3.6 in view of Ci = 1. □ 

Leading term in the risk. The risk bound in the presented oracle result 
consists of two terms. The first one \/ aCr-e^ is just a constant. Moreover, by 
choosing a small a, one can make this term arbitrary small. The other term 
{2^k*Y^'^ is by the bound of Theorem 3.1 of order log-ftT and thus, under 
the assumption (MD), it is logarithmic in the sample size. This implies 
that asymptotically, as the sample size increases, the leading term in the 
risk bound is exactly the value (23^.)''/^. This particularly explains why the 
choice of a possibly small critical values is an important issue. 

Payrnent for adajitation. Recall that in the parametric situation, the risk 
B\v'j^}{9k* — 9)'^\ of 9k* is bounded by Ci = 1; see (2.2). In the nonparametric 
situation, the result is only slightly worse. The risk bound includes \/2^, 
which can be logarithmic in the sample size. In addition, it bounds the 
absolute loss |^ — 0| instead of squared loss. Finally, there is an additional 
factor \/e^, which accounts for the use of a wrong parametric model instead 
of the real one. 

3.6. SMB condition versus "bias-variance trade-off. " The standard ap- 
proach for selecting the optimal index k is based on balancing an upper 

— 1/2 
bound bk for the bias bk = 9k — 9 and the standard deviation VjJ of the esti- 
mate 9k, see for example Lepski, Mammen and Spokoiny (1997) or Golden- 
shluger (1998) for a related discussion. This section shows that under some 
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additional technical assumptions this approach is nearly equivalent to the 
SMB condition advocated in this paper. 

In addition to {MD) we suppose the following properties of the covariance 
matrices = Cov{6{k)). Let -Bfc^diag be the diagonal matrix with the same 

diagonal entries Vk as for Bi^. Define also Dk = b]J'^ and -Dfc,diag = 
The required conditions reads as follows: 

{Dk) It holds for some constant s and all /c < 

Here the notation B for two symmetric matrices A, B means that \Av\ < 
\Bv\ for any vector v. If B is invertible, this is equivalent to saying that the 
maximal eigenvalue of the matrix B~^A^B~^ is bounded by 5^. 

Condition (Dk) allows to rewrite the SMB condition \D'j^^b{k)\'^ < A in 
the following form: 

(3-4) \Dk^ai.,Hk)\' ^ bl/vi + ■■■ + bl/v, < A/s^. 

Let bk be a monotonously increasing upper bound for |6fc|: bk= max;<^. 
For the considered problem of pointwise estimation, the bias- variance trade- 
off is usually written in the form 

(3.5) h* < Cbvli^ 

for some fixed constant Cb', see Lepski, Mammen and Spokoiny (1997). The 
next result shows that this relation implies the SMB condition (3.4). 



Theorem 3.8. Suppose (MD) and (Dk). Then for the index k* defined 
by the balance relation (3.5), the SMB condition A^* < A is also fulfilled 
with A = s^CnaC^. 

— —1/'' 
Proof. Let k be such that bk < C^Vj^ . Then 

b\lv^ + • • • + bllvu < bl{vi^ + ---+v^^)< C^otvk\ 
Cuo = (1 — Uq ^)^"'^. Now condition (Dk) provides 

|Z),7i6(A:)p <B2|I),-;ij^g6(fc)P <52a„6',^,i <52Cu„C2 
and the assertion follows. □ 



Combination of the results of Theorem 3.8 and Corollary 3.7 yields the 
following. 
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Corollary 3.9. Suppose (MD) and (Dk) and let the index k* be de- 
fined by the balance relation (3.5). Then for A =s^Cu(,C^ and any r > 



We conclude this section by a small discussion about relations between of 
the oracle result and minimax rate of convergence. Most of the theoretical 
results in the statistical literature are stated about the asymptotic minimax 
rate of estimation on the functional classes. See for example Lepski (1990, 
1992) and Lepski, Mammen and Spokoiny (1997) for pointwise regression es- 
timation or Goldenshluger (1999) and Goldenshluger and Pereversev (2003) 
for some results in the linear inverse problem. The rate optimal procedures 
can be obtained using the bias- variance relation (3.5). An immediate corol- 
lary of Theorem 3.8 is that the proposed adaptive^estimate that selects one 
out of the family of the spectral cut-off estimates Ok is rate optimal (up to a 
logarithmic multiplier) for all such set-ups, because it also achieves the ac- 
curacy corresponding to the balance relation. A precise formulation of this 
result lies beyond the focus of this paper. 

3.7. Application to the "sequence space" model. This section specifies 
the general results to the sequence space example considered in Section 1.3. 

In this case, 9k = yi-\ h Umk ■> Vk = (Ti-\ h cr^^ with mi> m2> ■■■> 

rriK > 1- We additionally assume that af are monotonously increasing in i. 
The condition (MD) means in this situation that the indices properly 
decrease to provide an exponential decrease of the sums Vk in k. The next 
result shows that this condition ensures {Dk). 

Lemma 3.2. For the model (1.3), the condition {MD) implies (Dk) with 
the constant s = (1 — l/uo)~^/^. 

The proof is given in the Appendix. The estimate 6^ has the bias b^ = 
Ok — = — Y^iLruk+i /^«- "^^^ bias- variance relation (3.5) balances the nonde- 
creasing envelope bk = maxi<fc \ bi\ with the variance leading to the oracle 
choice k*. Corollary 3.9 ensures for the adaptive estimate 9 the accuracy of 
order v^* up to the multiplicative factor \/2jk* ■ 

4. Simulation. This section illustrates the performance of the proposed 
procedure by means of two simulated examples. The first correspond to a 
severely ill-posed inverse problem with exponentially increasing variances af 
and the second to a regularly ill-posed problem with polynomially increasing 




PARAMETER TUNING IN POINTWISE ADAPTATION 



19 



Table 1 

Critical values computed under the null hypothesis from 50000 replications, when K = 20 
and {oi = {{n^^"y)i^i,....n using the sequential procedure 



T" a 3i 32 33 34 35 36 3r 38 39 310 311 312 313 314 315 3i6 317 3i8 319 

0.5 1.0 15.5 13.0 12.8 12.2 11.5 11.3 10.9 9.8 9.2 8.6 8.3 7.6 7.0 6.6 5.9 5.2 4.5 3.6 2.5 
1.0 1.0 22.5 19.0 16.4 17.2 16.2 15.6 16.8 14.4 13.4 13.2 12.9 11.9 10.2 9.3 8.3 7.3 5.8 4.7 3.4 



values af. We focus on two important features of our procedure: "propaga- 
tion property" and "adaptivity." The "propagation" property means that 
the selected index only in very few cases is smaller than the oracle one, that 
means, the "false alarm" situation, when the procedure stops but the mod- 
eling bias is still small, is very rare. The "adaptivity" means that the ratio 
of the risk of the adaptive estimate to the risk of the oracle one is bounded 
by some fixed constant. 

For simplicity we consider "sequence space" models, that is, the data Yi 
are generated by the following model: Yi = fii + cri6ei, for i = 1, . . . ,n for 
n = 50 and we assume that are i.i.d. standard normal. In each example 
the values (//j)i=i,...,n are generated randomly from a centered Gaussian with 
a decreasing variance and we consider 10 different models of this type. 
The error level 6 is equal to 10~^, 10~^ or 10~^. In every example, the target 
is the sum of the parameters /ij , that is, 9 = J27=i f^i- This set-up is friendly 
advised by F. Bauer, see for example, Bauer (2007). 

We apply the proposed procedure to the family of "weak" estimates 6^ = 
X]i=i ^ • Our default choice of the "metaparameters" a and r is q = 1 and 
r = 1/2. We also report the similar results for r = 1, which illustrate that the 
critical values slightly increase with r. More numerical results (not reported 
here) indicate that the critical values increase with r and decrease with 
a; however, the final results are rather insensitive to the choice of these 
metaparameters. 

In the first example we choose cjj = a* for i = 1, . . . ,n, where a = r?!'^ . 
We consider the estimates = ^ with = [n — 2 * (A; — 1)], for 
k = l,...,K and K = 20, then mx = 12. 

The critical values are computed from 50,000 Monte Carlo replications 
from the null hypothesis (pure noise model) using the sequential procedure 
from Section 2.2, see Table 1. 

Figure 1 compares the results for our adaptive estimate with the oracle 
one. The oracle value k* is defined as max{k : < 1}. The results for other 
values of A, for example, A = 0.5 or A = 2 are very similar and we do not 
report them here. Each row corresponds to a different level of the noise 5. 
The panel (a) draws the ratio of the adaptive risk E|0 — 0| obtained from 500 
realizations to the corresponding oracle risk E|0/j* — 6\ for the 10 different 
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Table 2 

Critical values computed under the null hypothesis from 50000 replications, when 
K =15 and (ai = i^)i=i,...,n using the sequential procedure 



r 


a 


31 


32 


33 


34 


35 


36 


3T 


38 


39 


3io 


311 


3l2 


3l3 


314 


0.5 


1.0 


5.5 


5.0 


4.6 


4.3 


4.1 


3.9 


3.4 


3.1 


2.8 


2.6 


2.2 


1.7 


1.3 


0.9 


1.0 


1.0 


8.1 


7.9 


6.4 


6.6 


7 


5.8 


4.8 


4.3 


3.9 


3.6 


3.0 


2.0 


1.5 


1.0 



models. In the panel (b) we show the box-plot of k from 500 replications 
and the "oracles" values k* (triangles) for the 10 different models described 
above. One can see that the adaptive risk is in the most of cases not more 
than twice larger than the the oracle risk. The oracle choice k* is usually 
smaller than the adaptively selected k, which illustrates "propagation" prop- 
erty: procedure does not stop until k* . It is also worth noticing that both 
the oracle choice k* and the adaptive values k decrease with the noise, that 
is, the smaller the noise, the more coefficients yi are taken for estimating 
the sum 9 = J2i l^i- 

In the second example we consider a model with (cjj = i^)j=i,...,n and 
apply the estimates 6k = X^Sa ^« with = [n/ (2^/^)'"'"^], for k = 1, . . . , K 
and K = 15, leading to mx = 7. The critical values 3^ are computed from 
50,000 Monte Carlo replications under the null hypothesis, see Table 2. 

Figure 2 presents the results comparing the performance of the adaptive 
and oracle estimates in the second example. The set-up is the same as in 
the first example and the results are very similar. 

We conclude from this simulation study that the performance of the 
method is completely in agreement with the theoretical conclusions and the 
procedure demonstrates quite reasonable performance in all the examples 
including regular and severely ill-posed problems and for different configu- 
rations of the signal and different noise levels. 

APPENDIX. 

We start with some useful technical result. Let (^1,^2) be a Gaussian 
vector with zero mean, E,^f = E,^! = 1 ^^'^ P — E^i^2- The correlation coef- 
ficient p uniquely describes the joint distribution of ^1 and ^2 enabling to 
define for r > and 3 > 

Q,(/,,3) = E[|6p^'l(e2V2>3)], = supQ(p,3). 

p 

Below we utilize some simple bounds on the quantities QriPih) and Q^il)- 

Lemma A.l. For any r > and any 3 > 1 

Ql{l)<{Ci{T) + C2{r)f]i-^l\-\ 



PARAMETER TUNING IN POINTWISE ADAPTATION 



21 




a 7 4 & a ID 1^34 5 67E9 1C 



Fig. 1. The result for the first example with S = 10^'' (top), S = 10^"" (middle) and 
5 = 10^^ (bottom). Left: the ratio of the adaptive risk E|S — 9\ to the oracle risk E|^fe» — 6*1 
as function of the model. Right: the boxplots of the adaptive index k based on 500 runs. 
The triangles show the oracle values k* . 



where Ci(r) and C2{r) depend on r only. Moreover, for any 3 > 1 

infQ,(p,3)>C3(r)3"i/2g-,, 
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Fig. 2. r/ie result for the second example with S — 10"** (top), S = 10~* (middle) and 
S = 10~^ (bottom). Left: the ratio of the adaptive risk E\6 — 6\ to the oracle risk E\6k' — 0\ 
as function of the model. Right: the boxplots of the adaptive index k based on 500 runs. 
The triangles show the oracle values k* . 



Proof. Represent .^i as p^2 + p£,i where p fulfills p^ + p^ = I and is 
standard normal and independent of ^2- Note that 

Q,(p,3)=E|p6 + pei|'"l(e2V2>3) 
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= o.5E|pe2 + peir'i(e272 > 3) + o.5e|p6 - peir''i(C272 > a)- 

One can easily see that there are constants ci(r),c'|(r) > such that for any 

c'i(r){|px|2^^ + \py\^'] < \px + py\'' + \px - py\'' < ci{r){\px\'^ + \py\'^}. 

It is straightforward to check that for some other constants < C2(r) < C2(r), 
< c'^{r) < C3(r) and 3 > 1 

c',ir)r'^h-^ < E|e2p^l(ei > 23) < C2{r)r'/^e-^, 

4(r)3-V2e-3 < Ei(e| > 23) < C3(r)r'/'e-^ 

The simple algebra yields now 

Qrip,^) < 0.5ci(r)E{|pe2|'^ + IMil^lHCl > 23) 



< 0.5ci{r){c2{r)i + C3(r)c,}r^ e 



1/2^-3 



-1/2^-3 



Qr{p,i) > 0.5c[{r)E{\pC2r + |peir^}l(e272 > 3) > C3(r)3"V2g 
as required. □ 

Proof of Theorem 3.1. Define for every m <k < K the random set 



dcf 



mk 



and 



{Of: = 9m}- The definition of the procedure implies 

m 

1=1 



^oKHOk - ekf\'i{Bmk) < E ^oKHOk - 9mf\^i{vf\ei - ekf/2 > ^i). 

1=1 

Define for I <m< k 



1/2 



def TTi t t 
Plmk — ^Oi.lkl;mk- 



Vim. = Var(6li - 9m), tm = {Ol - ^m)/Vim, , 

The conditions of the theorem imply that vim < ivi for all I <m. Therefore 



1=1 



and 



fe-i 



mk I 



m=l 
k—1 m 



Vk 



Qr{pimk-,Uh) 



m=l 1=1 



1=1 



m=l 



Vm 
Vk 
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Condition (MD) implies that 



k~l 

E 

m=l 



< 



'Euo^""'^<C^(uo) 

m=l 



where C(uo) = (1 — Uq ) . This and Lemma A.l yield 



fc-i 



1=1 

k-l 



< C(r, 7, uo) ^ exp{ -3;/7 + r log(fi/ffc) + rlog(3/)}. 
1=1 

and it remains to check that the choice 3/ = ailog{K) +7log(a~^) + r^x 
log{vi/vK) with a properly selected oi = ai(r,7,uo,u) provides in view of 
{k — /)log(uo) < log(t;//t;fc) <{k — Z)log(u) the required bound Eo|t^^^(^fc — 
^fc)^r ^ CKCr for a\[ k <K and Theorem 3.1 follows. □ 

Proof of Theorem 3.2. We use again the decomposition 

K-l 



0\Vk{ 



ex) 



2\r 



Y,^o\v-K\Ok-eK?Yi{k = k) 



k=l 



for any k < K. The definition of k implies in the considered case with 31 
• • • = 3/s_i = 00 that 

i(k = k) = i{v^\ek+i - Bkfn > dk)- 

With p = pk,k+i,K = ^oCk,k+iCk,K it holds 

= (vfc,i^/v/^)^Eo|4,i^P''l(^fc^fc+i/2 > ikVk/vk,k+i) 

= {vk,K/vK)''Qr (P, ikVk/vk,k+l)- 

The propagation condition (2.4) implies now that 

log(aCr) >r\og{Vk,K/vK)+'^OgQr{p,lkVk/Vk,k+l) 

yielding in view of Lemma A.l that 



Vk,k+1 
Vk 

with some fixed constant Const, depending on r only. □ 



ik > —^-^{rlog{vk,K/vK) + '^oga ^ - Const. log(l + log(i;fc,i^/7;i^))} 

Vk 
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Proof of Lemma 3.2. It suffices to sliow tliat tlie minimal eigenvalue 
of the matrix Mfc = D^^^-^gB^D^ 

diag bounded away from zero, or, equiv- 
alently, the largest eigenvalue of is bounded from above: ||M^^||oo < 
(1 - l/uo)~^. Clearly EoOjOi = EioOf = vi for j < I, and Mfc is the symmet- 

1/2 — — 

ric matrix composed by the elements of the form pji = v- ^o6j9i = 

{vj/viY^"^ for j <l. In other words, is the covariance matrix for the set 
of random variables r]i = Oi/vj^"^ for I = 1, . . . ,k. 

]^/2 — — 1/2 

Define = {6i — for I < k and Jk = rjk- The random 

variables 7; are independent zero mean normal with the variance si *== 
'Ejf = v^^{vi — for / < k and = 1. The condition (MD) implies 

for alll<k that (1 - I/uq) < s« < (1 - 1/u). Define 7^^) = (71, . . . ,7^)"^ and 
■qif') = (^jj-^^ , , , jVk)'^ ■ The identities 'ji = iji — r]i+i{vi+i/viy^'^ for / < A: can 
be written as 7^^^^ = Akij^''^ , where line / of the matrix only contains 

1 /2 1/2 

only two nonzero entries: ai^i = 1 and az,/+i = — for / = 1, . . . , A; — 1. 
Again, the condition (MD) implies that ||/ — j4fc||oo < 1/uo and ||^^"'^||oo = 
||{/ - (/ - < (1 - 1/uo)"^ Similar bound holds for Aj . Obviously 

Eo7('=)(7('=))T = A = diag(si, . . . , Sfc). This yields 

rk = EAkr]^''\r]^''YAj = AkMkAl 

and ||M^"^||oo < Mfe • |lA~^lloo < (1 - l/uo)~^, then the result follows. 
□ 
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